AmpseqR is an R package for analysis of amplicon deep sequencing (AmpSeq) data generated on the Illumina platform. The pipeline offers various useful functions including Data pre-processing, Amplicon sequence variant (ASVs) estimation, Data post-processing, and Data visualization.
The html report include:
Data pre-processing, Amplicon sequence variant (ASVs) estimation, Data post-processing)The data used to generate this report is output by AmpSeqR, the run_data.rds data includes run_data.rds that contains marker_info, sample_manifest, demultiplexed, flt_reads, sub_reads, seq_tbl, seq_tbl_rm_indel, seq_tbl_end.
The dataset includes 40 samples.
The dataset includes 5 amplicon markers.
Marker names: Chr03, Chr05, Chr06, Chr07, Chr08
The pipeline provides an example to pull out how many reads were retained at various points of the pipeline (Data pre-processing, Amplicon sequence variant (ASVs) estimation, and Data post-processing). As a final check, this can track the reads through the pipeline.
Overview of read counts| Filtering Steps | Description |
|---|---|
| demultiplexed | Read count after demultiplexing |
| flt_reads | Read count after filtering the low quality score sequences |
| sub_reads | Down sampling per sample per amplicon data to certain number reads |
| seq_tbl | Read count after calling amplicon sequence haplotype, and remove the noise sequence |
| seq_tbl_rm_indel | Read counts after integrating sequences vary only for indels |
| seq_tbl_end | Read counts after removing failed samples and markers and analyzing sequences across the entire dataset |
The table represents the summary of the read counts at various filtering steps of the pipeline.
read_counts_tab(run)
The figure represents the proportion of sequences that are final retained from demultiplexed to seq_tbl_end.
read_counts_plot(run)
Followed we will look in more detail how many reads were dropped at various points of the pipeline.
Data pre-processingThe figure represents the remained reads per sample per marker after demultiplexing.
p <- run$demultiplexed %>%
filter(!(is.na(sample_id))) %>%
filter(!(is.na(marker_id))) %>%
ggplot(aes(sample, n, fill = marker_id)) +
geom_col() +
coord_flip() +
ggtitle('Demultiplexing counts')+
facet_wrap(~marker_id,nrow = 1)+
theme_light()+
theme(
legend.position = "none",
axis.text.x = element_text(angle = 90)
)
ggplotly(p)
The figure represents the summary of remained reads after demultiplexing by the amplicon marker.
p <- run$demultiplexed %>%
filter(!(is.na(sample_id))) %>%
filter(!(is.na(marker_id))) %>%
ggplot(aes(marker_id, n, fill = marker_id)) +
geom_boxplot() +
ggtitle('Demultiplexing counts')+
labs(y="Number of reads")+
theme_light()
ggplotly(p)
The table represents the remained reads per sample per marker after demultiplexing.
run$demultiplexed %>%
filter(!(is.na(sample_id))) %>%
filter(!(is.na(marker_id))) %>%
select(sample_id,marker_id,n,sample,info) %>%
DT::datatable(
extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('csv', 'excel'),
lengthMenu = list(c(10,25,50,-1),
c(10,25,50,"All")))
)
Filter and denoise readsThe figure represents the remained reads per sample per marker after filtering the low quality score sequences based on dada2::filterAndTrim function. Fail represents the failed reads. Pass represents the reads that passed the filter step.
p <- run$flt_reads %>%
mutate(fail = n_in - n_out, pass = n_out) %>%
select(sample_id, marker_id, pass, fail,sample) %>%
pivot_longer(c(pass, fail), names_to = 'status', values_to = 'n') %>%
ggplot(aes(sample, n, fill = status)) +
geom_col() +
coord_flip() +
facet_wrap(~ marker_id,nrow = 1,scales = "free_x") +
ggtitle('Read quality filtering (DADA2)')+
theme_light()+
theme(
axis.text.x = element_text(angle = 90)
)
ggplotly(p)
The figure represents the summary of remained reads after filtering the low quality score sequences by the amplicon marker. Fail represents the failed reads. Pass represents the reads that passed the filter step.
p <- run$flt_reads %>%
mutate(fail = n_in - n_out, pass = n_out) %>%
select(sample_id, marker_id, pass, fail,sample) %>%
pivot_longer(c(pass, fail), names_to = 'status', values_to = 'n') %>%
ggplot(aes(marker_id, n, fill = status)) +
geom_boxplot(aes(group=status)) +
ggtitle('Read quality filtering')+
labs(y="Number of reads")+
theme_light()
ggplotly(p) %>%
layout(boxmode = "group")
The table represents reads went in (n_in) and how many reads made it out (n_out) after filtering the low quality score sequences.
run$flt_reads %>%
mutate(fail = n_in - n_out) %>%
select(sample_id, marker_id, n_in, n_out, fail,sample,info) %>%
DT::datatable(
extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('csv', 'excel'),
lengthMenu = list(c(10,25,50,-1),
c(10,25,50,"All")))
)
Amplicon sequence variant (ASVs) estimation & Data post-processingHere, we check how many reads were dropped at Amplicon sequence variant (ASVs) estimation & Data post-processing steps.
Amplicon sequence variant (ASVs) estimation filters amplicon sequence haplotypes that are likely errors base on dada2 sample inference algorithm.
Data post-processing filters chimera sequences and maps amplicon sequence haplotypes to reference sequences to remove potentially erroneous amplicon sequence haplotypes and remove amplicon sequence haplotypes that do not meet thresholds (Parameters set before running AmpSeqR).
The amplicon sequence haplotype status:
| Type | Description |
|---|---|
| pass | Read count that passed the filter step |
| low_sample_count | Failed: The sample read count lower than the minimum number of read per sample per marker |
| low_asv_count | Failed: The amplicon sequence haplotype count lower than the minimum number of read for each haplotype |
| low_asv_freq | Failed: The amplicon sequence haplotype frequency lower than the minimum haplotype frequency |
| low_ident | Failed: The amplicon sequence haplotype similarity (map to reference) lower than the minimum sequence similarity |
| low_ident_z | Failed: The standardized amplicon sequence haplotype similarity (map to reference) lower than the minimum standardized sequence similarity |
| chimera | Failed: The amplicon sequence haplotype is chimera reads |
| multiple | Failed: multiple fail types |
The figure represents the passed reads and failed reads (different types) per sample per marker after Amplicon sequence variant (ASVs) estimation & Data post-processing steps.
status_pal <-
run$seq_tbl %>%
mutate(status = if_else(str_detect(status, ';'), 'multiple', status)) %>%
select(status) %>%
distinct() %>%
filter(status != 'pass') %>%
mutate(colour = scales::hue_pal()(n())) %>%
add_row(status = 'pass', colour = 'grey50') %>%
with(setNames(colour, status))
p <- run$seq_tbl %>%
mutate(status = if_else(str_detect(status, ';'), 'multiple', status)) %>%
ggplot(aes(sample, count, fill = status)) +
geom_col() +
scale_fill_manual(values = status_pal) +
coord_flip() +
facet_wrap(~marker_id,nrow = 1,scales = "free_x") +
ggtitle('ASV status filtering')+
theme_light()+
theme(
axis.text.x = element_text(angle = 90)
)
ggplotly(p)
The figure represents the summary of passed reads and failed reads (different types) after Amplicon sequence variant (ASVs) estimation & Data post-processing steps.
p <- run$seq_tbl %>%
mutate(status = if_else(str_detect(status, ';'), 'multiple', status)) %>%
ggplot(aes(status, count, fill = status)) +
geom_boxplot(aes(group=status)) +
scale_fill_manual(values = status_pal) +
ggtitle('ASV status filtering')+
facet_wrap(~marker_id,scale='free_x',shrink = FALSE)+
coord_flip()+
theme_light()+
theme(
axis.text.y = element_text(size=15)
)
p
The table represents the summary of the amplicon sequence haplotype status after Amplicon sequence variant (ASVs) estimation & Data post-processing steps.
run$seq_tbl %>%
select(sample_id, marker_id, count,status, ident, ident_z, sample,info) %>%
DT::datatable(
extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('csv', 'excel'),
lengthMenu = list(c(10,25,50,-1),
c(10,25,50,"All")))
)
The figure represents missing values present in the dataset.
The table represents marker missingness in the dataset.
marker_id | Missingness |
Chr03 | 0.075 |
Chr05 | 0.175 |
Chr06 | 0.200 |
Chr07 | 0.125 |
Chr08 | 0.150 |
A total of 37/(of 40) Samples remained for downstream analysis after filtering steps.
A total of 5/(of 5) Markers remained for downstream analysis after filtering steps.
The figure represents an overview of the read counts for amplicon markers of all samples in the dataset.
hap_read_counts_plot(run)
The diversity of each sample by different amplicons. This is measured by the count of SNP-based haplotypes for all amplicons.
The figure represents the unique haplotypes detected for each amplicon each sample.
unique_haplotypes_heatmap(run$seq_tbl_end)
The figure represents the average number of unique haplotypes detected for each amplicon in the dataset.
p <- run$seq_tbl_end %>%
group_by(sample,marker_id) %>%
summarise(n=n()) %>%
ungroup() %>%
ggplot(aes(x=marker_id, y=n, fill=marker_id)) +
geom_boxplot()+
geom_point(color="black", size=0.4, alpha=0.5,position=position_jitter(w=0.3,h=0.1),shape = 21,aes(label=sample)) +
theme_bw() +
xlab("Amplicon Marker")+
ylab("Number of distinct haplotypes")+
theme(
axis.text.x = element_text(size=16),
axis.text.y = element_text(size=16),
axis.title.x = element_text(size=17),
axis.title.y = element_text(size=17),
strip.text = element_text(size=17),
legend.text = element_text(size=15),
legend.title = element_text(size=16),
legend.position = "none"
)
ggplotly(p,tooltip = c("sample","n"))
| Metrics | Description |
|---|---|
| Haplotype | the unique haplotypes detected in dataset |
| He | expected heterozygosity |
| Size(bp) | the amplicon length |
| SNPs | the number of variable sites detected in dataset compared to the reference sequence |
| Ave_hap | the average number of unique haplotypes detected for each amplicon in dataset |
hapdiv_metrics(run$seq_tbl_end,run$marker_info)
Marker | Haplotype | He | Size(bp) | SNPs | Ave_hap |
Chr03 | 4 | 0.71 | 126 | 9 | 1.41 |
Chr05 | 18 | 0.94 | 100 | 7 | 2.27 |
Chr06 | 11 | 0.88 | 133 | 17 | 1.47 |
Chr07 | 11 | 0.87 | 118 | 44 | 1.51 |
Chr08 | 4 | 0.69 | 139 | 10 | 1.15 |
The sequence represents the unique haplotype sequence detected for each amplicon in dataset. Labeled by haplotype name (haplotype frequency%).
haplotype_sequence_vis(run$seq_tbl_end,run$marker_info)
20 40 60 80 100 120
'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''
Chr03-1(38.5%) ········A····-······T·A·········TTATAATTCCGCGC················GC··································TG··························· 126
Chr03-2(34.6%) ········A····-······T·A·········TTATAATTCCGCGC················GC··································CT··························· 126
Chr03-3(15.4%) ········A····-······T·A·········TTATAATTCCGCGC················TT··································TG··························· 126
Chr03-4(11.5%) ········G····C······G·G·········--------------················TT··································TG··························· 113
Reference ········A····-······T·A·········TTATAATTCCGCGC················GC··································CT··························· 126
Consensus GGAAGGGGRGTCA+CCCTGCKGRGGAACGCGC++++++++++++++ACATGAAAGTGCACATKYTTGCGTTTGTTCATGTGTGCGTTTTGTTTATGTAYKAATGTTCTTGTCGTATGTCTGCTTATA 127
20 40 60 80 100
'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'
Chr05-1(12%) -······GC······················································C···························TG····G··· 100
Chr05-2(9.3%) -······AT······················································A···························TT····G··· 100
Chr05-3(9.3%) ·······AT······················································A···························TT····G··· 101
Chr05-4(8%) -······GC······················································C···························CT····A··· 100
Chr05-5(8%) ·······GC······················································C···························CT····A··· 101
Chr05-6(8%) ·······GC······················································C···························TG····G··· 101
Chr05-7(5.3%) -······AT······················································C···························TG····A··· 100
Chr05-8(5.3%) -······GT······················································C···························CT····A··· 100
Chr05-9(5.3%) -······GT······················································C···························TT····G··· 100
Chr05-10(5.3%) ·······GT······················································C···························CT····A··· 101
Chr05-11(5.3%) ·······GT······················································C···························TT····G··· 101
Chr05-12(2.7%) -······AT······················································A···························CT····A··· 100
Chr05-13(2.7%) -······AT······················································C···························CT····A··· 100
Chr05-14(2.7%) -······GC······················································A···························TG····A··· 100
Chr05-15(2.7%) -······GC······················································C···························TG····A··· 100
Chr05-16(2.7%) ·······AT······················································A···························CT····A··· 101
Chr05-17(2.7%) ·······AT······················································C···························TG····A··· 101
Chr05-18(2.7%) ·······GC······················································C···························TG····A··· 101
Reference -······AT······················································A···························TT····G··· 100
Consensus CCATATGRYATATCCTTTTTTTTTTCTTTGTCATGGGCAGGCTTTACCGCTATTTTGTGTATTMTTTTCGAGTATCTCTTCTTTCTTTTAAYKCCAARCAG 101
20 40 60 80 100 120
'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|''''''
Chr06-1(27.7%) ·····AA·············A·····C·······-······TG··G·····················A···················G···A-···G·G·····-·A···G··T······················ 133
Chr06-2(12.8%) ·····CA·············C·····C·······-······TT··T·····················A···················A···CT···T·A·····G·T···G··G······················ 135
Chr06-3(8.5%) ·····CA·············A·····T·······-······TG··G·····················A···················A···CT···T·A·····G·T···G··G······················ 135
Chr06-4(8.5%) ·····CA·············C·····C·······-······CG··T·····················A···················G···A-···G·A·····-·A···G··T······················ 133
Chr06-5(8.5%) ·····CA·············C·····C·······-······CG··T·····················A···················G···A-···G·G·····-·A···G··T······················ 133
Chr06-6(8.5%) ·····CA·············C·····C·······-······CG··T·····················G···················G···A-···G·A·····-·A···G··T······················ 133
Chr06-7(8.5%) ·····CA·············C·····C·······-······TT··T·····················A···················G···AT···T·A·····G·T···G··G······················ 135
Chr06-8(4.3%) ·····AA·············A·····C·······-······TG··G·····················G···················G···A-···G·A·····-·A···G··T······················ 133
Chr06-9(4.3%) ·····AC·············A·····C·······A······TT··T·····················A···················G···A-···G·G·····-·A···G··T······················ 134
Chr06-10(4.3%) ·····AC·············A·····C·······-······TG··G·····················A···················G···A-···G·G·····-·A···G··T······················ 133
Chr06-11(4.3%) ·····CA·············C·····C·······-······TT··T·····················A···················A···CT···T·A·····G·T···A··G······················ 135
Reference ·····AC·············A·····C·······-······TT··T·····················A···················G···A-···G·A·····-·A···G··T······················ 133
Consensus ATACTMMTACGCTTAACTGCMCTCAAYAAGTTGC+TACTCCYKCGKTATGTACCTCTGCTAAATAGGRTGTACCTACAAATTGCTACRTTTM+TTTKCRCACCT+GWTTTRATKTACATTTTGCACGTTCCCATTT 136
20 40 60 80 100
'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''
Chr07-1(24.5%) ·AC··C······TGC-·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··AAG···A·A··AC·GC·G·····GCTA···TT·CT··· 118
Chr07-2(20.8%) ·AC··C······TGC-·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··ATA···C·A··AC·AC·G·····GCTA···TT·CT··· 118
Chr07-3(15.1%) ·GA··G······AAC-·G··A···············A·G·······AAAGC·AG····GG·AC··A·····T···G··A··TTA···G·C··GA·AG·A·····AACG···TT·TT··· 118
Chr07-4(9.4%) ·AC··C······TGA-·G··G···············C·G·······AGCAG·AG····TA·GA··A·····A···G··G··ATA···C·A··AC·AC·G·····ACCC···CG·TG··· 118
Chr07-5(7.5%) ·GA··G······AAC-·G··A···············A·G·······AAAGC·AG····GG·GC··A·····A···A··A··ATA···C·A··AC·AC·G·····GCTA···TT·CT··· 118
Chr07-6(7.5%) ·GA··G······AAC-·G··A···············A·G·······AACAC·TG····GA·GC··C·····T···G··G··AAG···G·C··AA·AG·G·····ACCC···TG·AT··· 118
Chr07-7(3.8%) ·AC··C······TGC-·C··A···············A·G·······AGCAG·AG····TA·GA··A·····A···G··G··GTA···C·A··AC·AC·G·····ACCC···CG·TG··· 118
Chr07-8(3.8%) ·AC··C······TGC-·C··A···············C·G·······AGCAG·AG····TA·GA··A·····A···A··A··AAG···A·A··GC·GC·G·····ACCC···CG·TG··· 118
Chr07-9(3.8%) ·GA··G······AAC-·G··A···············A·A·······AACAC·TG····GA·GC··C·····T···G··G··AAG···G·C··AA·AG·G·····ACCC···TG·AT··· 118
Chr07-10(1.9%) ·AC··C······TGCA·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··ATA···C·A··AC·AC·G·····GCTA···TT·CT··· 119
Chr07-11(1.9%) ·AC··T······TGCA·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··ATA···C·A··AC·AC·G·····GCTA···TT·CT··· 119
Reference ·AC··C······TGC-·C··A···············A·G·······CACAC·AA····GG·GC··A·····A···A··A··AAG···A·A··AC·GC·G·····GCTA···TT·CT··· 118
Consensus GRMTGBAGTGAAWRM+ASAGRTTAAGAAAGTCGAAGMTRATATTAAMRMRSAWRATGAKRARMTTMAAAAGWTAGRAARTGDWRCCAVTMAARMTRSTRAAAAGRMYVAATYKAHKGCC 119
20 40 60 80 100 120 140
'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'''''''''|'
Chr08-1(41%) ····G···CCA··········T········A········-······················T···T········A·····A·-························································· 139
Chr08-2(35.9%) ····G···CCA··········T········G········A······················T···T········A·····A·-························································· 140
Chr08-3(17.9%) ····G···CCA··········T········A········-······················G···C········G·····T·G························································· 140
Chr08-4(5.1%) ····A···---··········A········A········-······················G···C········G·····T·G························································· 137
Reference ····G···CCA··········T········A········-······················T···T········A·····A·-························································· 139
Consensus CGGGRGGT+++TCCATCAGATWTGCTACTTRTAGAATGG+CCAAGACAAATGACTCATCTGTKTAGYAGTGGTTARAGTGTWC+AAGAGAGGCACTTATTGAGCGATTGGCAAATTTCCATATTAGCTACGAAGATGCTTA 141
The figure represents the distinct haplotypes & SNP positions for each amplicon in dataset. The X-axis represents the SNP position and the Y-axis represents the haplotype name. The red color indicates that the haplotype contains SNPs at certain positions compared to the reference sequence.
hap_seq_sim(run$seq_tbl_end,run$marker_info)
This haplotype tree is created by integrating all major haplotype sequences (>50%) in each sample and perform multiple sequence alignment based on DECIPHER::AlignSeqs function to identify the similarity between samples (ape::nj).
The figure represents the haplotype tree, the amplicon markers represent the border color, the haplotype names represent the fill color (Due to there are a lot of haplotype names, colors of different markers may repeat).
hap_tree_plot(run$seq_tbl_end)
Since some samples are missing certain markers, the imputation are performed by replacing missing data within a variable by the mode (the most frequent value in each marker). The imputed values are colored in white.
hap_tree_plot_imputed(run$seq_tbl_end)
Principal component analysis is performed by integrating all major haplotype sequences (>50%) in each sample and perform multiple sequence alignment based on DECIPHER::AlignSeqs function to identify and analysis sample genotypes.
hap_pca(run$seq_tbl_end)
Since some samples are missing certain markers, the imputation are performed by replacing missing data within a variable by the mode (the most frequent value in each marker).
hap_pca_imputed(run$seq_tbl_end)